Skip to content

Implement getSchemas() for SEA#802

Merged
jayantsing-db merged 10 commits into
mainfrom
jayantsing-db/show-schemas-client
May 22, 2025
Merged

Implement getSchemas() for SEA#802
jayantsing-db merged 10 commits into
mainfrom
jayantsing-db/show-schemas-client

Conversation

@jayantsing-db

@jayantsing-db jayantsing-db commented Apr 21, 2025

Copy link
Copy Markdown
Collaborator

Description

  • Retrieve all catalogs.
  • For each catalog, fetch schemas in parallel and merge the results.
  • On the e2-dogfood environment, which has around 9,000 schemas, this approach took approximately 12 seconds—compared to 9 seconds with the existing driver—an acceptable difference.
  • The runtime update for SHOW SCHEMAS IN ALL CATALOGS is currently under BEHAVE/compatibility review. Meanwhile, this client-side implementation serves as a temporary solution.
  • Introduced a JdbcThreadUtil class to simplify multi-threaded operations in JDBC. To ensure correctness, thread context must carry the appropriate connection-related details.

Testing

  • End to end testing
  • Unit tests

Additional Notes to the Reviewer

@jayantsing-db jayantsing-db requested a review from Copilot April 21, 2025 11:18
@jayantsing-db jayantsing-db marked this pull request as ready for review April 21, 2025 11:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the getSchemas() method for the SEA client by retrieving catalogs and fetching their schemas in parallel, using a newly introduced JdbcThreadUtils for multi-threaded operations. Key changes include:

  • Introducing JdbcThreadUtils with parallelMap and parallelFlatMap methods for concurrent task execution.
  • Modifying DatabricksDatabaseMetaData#getSchemas() to fetch catalogs and schemas concurrently.
  • Adding comprehensive tests for JdbcThreadUtils in JdbcThreadUtilsTest.java.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/test/java/com/databricks/jdbc/common/util/JdbcThreadUtilsTest.java New tests covering multiple execution paths for the JdbcThreadUtils methods.
src/main/java/com/databricks/jdbc/model/telemetry/enums/DatabricksDriverErrorCode.java Updated error code ordering with addition of OPERATION_TIMEOUT_ERROR.
src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java Introduced parallelMap and parallelFlatMap; contains error handling for interruptions and execution errors.
src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java Updated getSchemas() to process catalog schemas in parallel using JdbcThreadUtils.
Comments suppressed due to low confidence (2)

src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java:71

  • The error code THREAD_INTERRUPTED_ERROR is referenced but not defined in DatabricksDriverErrorCode. Please add an appropriate definition or update to an existing error code.
throw new DatabricksSQLException("Parallel execution interrupted", e, DatabricksDriverErrorCode.THREAD_INTERRUPTED_ERROR);

src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java:80

  • The error code INVALID_STATE is used but not defined in DatabricksDriverErrorCode. Please add it to the enum or replace it with an existing, appropriate error code.
throw new DatabricksSQLException("Error in parallel execution", e, DatabricksDriverErrorCode.INVALID_STATE);

Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the getSchemas() method for the SEA client by fetching catalogs and concurrently retrieving schema information, while introducing the JdbcThreadUtils for parallel operations. Key changes include:

  • Adding JdbcThreadUtils and associated unit tests for parallel processing.
  • Enhancing error handling and telemetry error codes.
  • Updating the getSchemas() method in DatabricksDatabaseMetaData to use parallel processing.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java Introduces parallelMap and parallelFlatMap methods with proper thread-context handling.
src/test/java/com/databricks/jdbc/common/util/JdbcThreadUtilsTest.java Adds comprehensive unit tests for parallel execution, exception, and timeout scenarios.
src/main/java/com/databricks/jdbc/model/telemetry/enums/DatabricksDriverErrorCode.java Inserts a new error code (OPERATION_TIMEOUT_ERROR) and reorders error codes.
src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java Updates getSchemas() to fetch catalogs and execute schema retrieval in parallel.
Comments suppressed due to low confidence (1)

src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java:73

  • The error code THREAD_INTERRUPTED_ERROR is referenced here but is not defined in the DatabricksDriverErrorCode enum. Consider adding the THREAD_INTERRUPTED_ERROR to the enum or using an existing error code that reflects an interruption.
throw new DatabricksSQLException("Parallel execution interrupted", e, DatabricksDriverErrorCode.THREAD_INTERRUPTED_ERROR);

Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java Outdated
Comment thread src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java Outdated
Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a basis for selecting this particular value for the timeout? Had the same concerns for the max threads variable, but that is configurable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i added based on if timeout is long enough and below is one data-point:

On the e2-dogfood environment, which has around 9,000 schemas, this approach took approximately 12 seconds—compared to 9 seconds with the existing driver—an acceptable difference.

Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we testing parallel execute here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular test method just tests the parallel execute with a single item/task testParallelExecuteWithSingleItem. But there are other test methods that tests parallel execute with multiple items.

Testing methodology:

  • parallel map is executed over a set of items
  • for each item, the task is to create a new upper case string String::toUpperCase
  • we assert the upper-case

Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java Outdated
@jayantsing-db jayantsing-db merged commit aa55100 into databricks:main May 22, 2025
15 of 16 checks passed
@jayantsing-db jayantsing-db deleted the jayantsing-db/show-schemas-client branch May 22, 2025 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants